Genetic Diversity and Spatiotemporal Dynamics of Chikungunya Infections in Mexico during the Outbreak of 2014–2016

Chikungunya virus (CHIKV) is an alphavirus transmitted by Aedes mosquitoes, which causes Chikungunya fever. Three CHIKV genotypes have been identified: West African, East-Central-South African and Asian. In 2014, CHIKV was detected for the first time in Mexico, accumulating 13,569 confirmed cases in the following three years. Studies on the molecular diversification of CHIKV in Mexico focused on limited geographic regions or investigated only one structural gene of the virus. To describe the dynamics of this outbreak, we analyzed 309 serum samples from CHIKV acute clinical cases from 15 Mexican states. Partial NSP3, E1, and E2 genes were sequenced, mutations were identified, and their genetic variability was estimated. The evolutionary relationship with CHIKV sequences sampled globally were analyzed. Our sequences grouped with the Asian genotype within the Caribbean lineage, suggesting that the Asian was the only circulating genotype during the outbreak. Three non-synonymous mutations (E2 S248F and NSP3 A437T and L451F) were present in our sequences, which were also identified in sequences of the Caribbean lineage and in one Philippine sequence. Based on the phylogeographic analysis, the viral spread was reconstructed, suggesting that after the introduction through the Mexican southern border (Chiapas), CHIKV dispersed to neighboring states before reaching the center and north of the country through the Pacific Ocean states and Quintana Roo. This is the first viral phylogeographic reconstruction in Mexico characterizing the CHIKV outbreak across the country.

The first cases of CHIKF were described in Tanganyika (present-day Tanzania) during the 1950s [9]. These were followed by some sporadic outbreaks in Africa and Asia [10].

Virus Isolation, Reverse Transcription-Polymerase Chain Reaction Amplification, and Nucleotide Sequencing
A total of 309 CHIKV serum samples from 15 Mexican states (Table 1), obtained between 2015 and 2016, were provided by the central laboratory of epidemiology of the Mexican Social Security Institute (IMSS).
The use of hypervariable regions to elucidate phylogenetic relationships at low taxonomic levels has proven successful [31,32]. For the present phylogenetic analysis, we chose to sequence three hypervariable regions of the CHIKV genome. These regions were selected through a genetic variability analysis of 368 whole genome aligned sequences from the Virus Pathogen Database and Analysis Resource (ViPR) of the National Institute of Allergy and Infectious Diseases (NIH/DHHS). To identify regions with the greatest variability, the number of nucleotide changes and the ratio of synonymous and non-synonymous mutations were determined for each codon in forty-five non-overlapping nucleotide-sliding windows. The regions with the highest number of mutations and highest ratios of synonymous and non-synonymous mutations (the amino terminal region from the NSP3 gene, a central region comprising part of the domains I and II of the E1 gene, and the B and C domains of the E2 gene) were chosen for subsequent analysis ( Figure S1). The preparation of CHIKV sequence libraries from serum samples followed the instructions of Preparing 16S Ribosomal RNA Gene Amplicons for the Illumina MiSeq System workflow [33]. Viral RNA was extracted from 140 µL of serum using QIAamp Viral RNA Mini Kits (Qiagen, Germany) and the genome regions were amplified using primers containing overhang adapters in a Triplex one-step PCR reaction ( Figure S2a). Reverse transcription was performed using Superscript III Reverse Transcriptase (Invitrogen) and amplified by polymerase chain reaction (PCR) using Platinum Taq DNA polymerase (Invitrogen) (For primers and PCR conditions see Table S1). The three amplicons corresponding to NSP3 (514 nt), E2 (547 nt) and E1 (479 nt) amplified regions were visualized in 2% agarose gels ( Figure S3) and subsequently purified using the Agencourt AMPure Beads ® system (Beckman Coulter, (United States, Indiana, Indianapolis)). Biotinylated magnetic AMPure beads allow for selection of specified cDNA products bound to streptavidin. 50 µL of amplified cDNA from samples were mixed and purified two times with AMPure XP beads at a 1.8:1 ratio (beads:sample), this ratio allows for optimal selection of all products higher than 100 nt. The three amplicons of each sample were pooled according to the location and date of sampling in 24 pools of a maximum of 20 samples for each library (Supplementary Archive S1) The Illumina system adapters containing indexes (known short sequences used to identify the sequences belonging to each library) were added using PCR reactions ( Figure S2b) (For index and adapters see Table S2). The libraries were sequenced in the MiSeq platform (Illumina, (United States, California, San Diego)), at the Mexican National Institute of Genomic Medicine (INMEGEN, Mexico, CDMX, Mexico City), with a 600 cycles V3 kit with a paired-end sequencing configuration, to obtain 300 bp paired end overlapping reads, following the manufacturer's instructions ( Figure S2c).

Data Sets
Adapters and poor quality reads (minimum Phred quality of 30) were removed using Trimmomatic software (Version 0.32, USADELLAB, Aachen, Alemania) [34]. The remaining sequences were aligned to the reference genome NC_004162.2 using Bowtie2 software (version 2.3.4.1, SOURCEFORGE, San Diego, California) [35]. Each data set was trimmed to a common length and the sequences of the regions of NSP3, E1, and E2 genes of each sample were concatenated in a single sequence of 1341 nucleotides, and the amplicon sequence variants (ASVs) within each pool were determined by clustering centroids with 100% identity using the VSEARCH software (Version 2.13.6, UNINETT, Oslo, Norway) [36] ( Table S3). The variants with the highest number of readings in each pool were accumulated until reaching an effective sample depth of 95% [37]. To prevent overrepresentation of some locations and dates, we constructed a preliminary MCC tree ( Figure S4), sequences with the same location and date that were grouped in the same clade were collapsed into a single sequence.

Phylogenetic Analysis
As a pre-processing step, sequence recombination was screened using GARD (Genetic Algorithm for Recombination Detection) [38] available in the Datamonkey web server. Clustal Omega was used for multiple sequence alignment based on the Percent nucleotide identities (PNI) calculated using p-distances [39]. The best-fit model of nucleotide substitution was selected based on the Bayesian Information Criterion (BIC) available in ModelTest 3.5 [40]. The GTR + G + I model (general time-reversible model with gamma-distributed rates of variation among sites and a proportion of invariable sites) was found to be the best-fit model.
The temporal information of the sequence data was used to estimate the evolutionary rate and the time to the most recent common ancestor (TMRCA), by generating a MCC tree, using the Bayesian Markov Chain Monte Carlo approach, as implemented in BEAST2 2.6.4 [41]. For this, we employed both strict and relaxed (uncorrelated exponential and uncorrelated lognormal) clock [42] models with the Bayesian Skyline tree prior. Three independent runs of the Bayesian Markov Chain Monte Carlo were carried out, each with at least 100 million generations and a sampling frequency of 10,000. The posterior probability and marginal likelihood of the models were used to choose the most suitable model for the data [43]. Tracer 1.5 assessed the convergence of the chain and the MCC tree was visualized in FigTree (Version 1.2.3, University of Edinburgh, Edinburgh, United Kingdom). The Bayes Factor analysis indicated that the uncorrelated exponential clock model fitted better than the strict clock or uncorrelated lognormal clock model. The corresponding output files generated by BEAST were utilized for further analysis.

Phylogeographic Inference
Phylogeny-trait association tests (AI, Association Index, and PS, Parsimony Score) available in BaTS [44] were performed to evaluate the association between phylogeny and geographical locations of the sequence data and hence the suitability for phylogeographic analysis. The rate of nucleotide substitution per site, per year (subs/site/year), the time to the most recent common ancestor (TMRCA), and the spatial diffusion rates (i.e., the rate at which viral lineages move among sampled locations) were jointly estimated from the date and location of each CHIKV sequence. For this, we used a Bayesian approach implemented in the Markov chain Monte Carlo (MCMC) inference framework of the BEAST v2.6.5 software package [45]. Analyses were carried out using a general time reversible (GTR) model with a discretized gamma-distributed across-site rate variation (GTR + I + Γ4) substitution model and a relaxed uncorrelated lognormal molecular clock model. MCMC was run sufficiently enough to ensure stationarity. The convergence of parameters was assessed by calculating the Effective Sample Size (ESS) using TRACER [46]. Maximum clade credibility (MCC) trees were summarized using TreeAnnotator v1.8 and visualized with FigTree v1.4.2. An MCC tree is a point-estimate characterizing the posterior distribution of trees and represents the tree topology yielding the highest posterior probabilities of individual clades. The branch lengths in these MCC trees are posterior median estimates [47]. Further, the tree nodes were annotated with their most probable (modal) location via color labeling. The MCC tree obtained was the input to the program SPREAD 1.0.3 [48] to visualize and analyze the dispersion pathways.

Nonsynonymous Mutations in Mexican CHIKV Sequences
We obtained between 711 and 5228 different concatenated sequences per pool, corresponding to the NSP3, E2, and E1 partial genes. After choosing the representative sequences from each pool (the sequences with the highest number of reads), we obtained 238 new sequence variants (Between 5 and 19 new sequence variants for each pool). After collapsing the sequences with the same location and date that were grouped within the same node according to the preliminary MCC tree ( Figure S4), 59 variant sequences remained. Eleven were isolated from patients living in the states of Baja California Sur, 2 from Baja California, 5 from Chiapas, 9 from Colima, 3 from Mexico City, 1 from Guerrero, 1 from Mexico State, 2 from Michoacan, 2 from Nuevo Leon, 6 from Oaxaca, 3 from Quintana Roo, 3 from Sinaloa, 3 from Tabasco, 6 from Veracruz, and two from Yucatan ( Figure S5).
The nucleotide and amino acid identity of our sequences were 99.50-100% and 99.35-100%, respectively. Compared to the first Mexican sequences reported in October 2014 [23], our sequences depicted 45 mutations in the NSP3 gene region (22 synonymous and 23 non-synonymous), 44 in E1 (20 synonymous and 24 non-synonymous), and 42 in E2 (19 synonymous and 23 non-synonymous). All non-synonymous mutations were unique to each isolate ( Figure S6 and Supplementary Archive S2). Compared to the ECSA genotype, our sequences depicted 14, 7, and 4 non-synonymous mutations in the proteins NSP3, E2, and E1, respectively. Compared to the Asian genotype, four non-synonymous mutations in protein NSP3 and one non-synonymous mutation in protein E2 were identified. Three non-synonymous mutations (E2 S248F and NSP3 A437T and L451F) were present in our sequences, the Caribbean lineage and one Philippine sequence ( Table 2).

Phylogenetic Analysis
We did not find recombinant sequences in the pre-processing phylogenetic analysis step. To assess the evolutionary relationship between our 59 new sequences and CHIKV sequences sampled globally, we aligned our sequences with those from four of the first cases detected in Mexico [23] and 48 CHIKV ECSA and Asian genotypes sequences obtained from GenBank. The relationship of the CHIKV sequences was examined using a MCC tree reconstructed with concatenated NSP3, E2 and E1 partial genes (1341 nucleotides). Our phylogenetic analysis resulted in a MCC tree that resolved the ECSA and Asian genotypes with a posterior probability of 1, the IOL clade with posterior probability of 1 and the Caribbean epidemic lineage with posterior probability of 1 ( Figure 1). Isolates from the Indian Ocean and Caribbean outbreaks conformed monophyletic clades in the ECSA and Asian genotypes, respectively. Our 59 new sequences clustered in three different clades within the Caribbean lineage of the Asian genotype. Other sequences reported from other countries in the Caribbean, Central and South America, from 2013 to 2015, were in the same cluster. These were closely grouped with one sequence of Saint Martin sampled at the beginning of the outbreak in the Americas in 2013 (posterior probability of 1) and one Philippine sequence from 2013 (posterior probability of 1) (Figure 1).   Tan

Phylogeographic Inference
The phylogeny-trait association tests indicated a strong association between sampling location and phylogeny, supporting the suitability of phylogeographic analysis. Among the locations analyzed, coast locations (B.C.S., Chiapas, Colima, Guerrero, Oaxaca, Quintana Roo and Veracruz) showed stronger phylogenetic clustering (p-value = 0.01) (Table S4).
To investigate the relationship among our sequences, we used a maximum clade credibility approach. The estimated substitution rate was 2.3 × 10 −3 substitutions per site per year (95% higher posterior density (HPD): 1.44-3.1 × 10 −3 subs/site/yr.). The estimated date of the most recent common ancestor (tMRCA) was January 2014, with a 95% HPD between April 2012 and August 2014.
The maximum clade credibility phylogeographic tree divided the Mexican sequences into three main clades. In the first (Clade A), sequences from the south of the country (Tabasco, Quintana Roo, Chiapas, Oaxaca, and Veracruz) and the North Pacific (Sinaloa and Baja California Sur) were grouped together. The most probable root state location for this node was one strain obtained in Quintana Roo (location probability of 0.69). The Baja California Sur sequences were grouped into a single clade, from which a Nuevo León sequence emerged (location probability 0.76) (Figure 2). The second clade (Clade B) grouped sequences from the Pacific coast (Guerrero, Colima, Michoacan, Sinaloa, Baja California, and Baja California Sur), center (Mexico City and Mexico State) and north (Nuevo Leon and Tamaulipas) of the country. The most probable root state location for this node was one strain obtained in southeast Chiapas in October 2014 at the beginning of the outbreak (location probability of 1). The next internal nodes indicated Guerrero and Colima as the most probable locations (location probability of 0.51). After this, the descending sequences were grouped into two clades: one grouping sequences from the north Pacific coast (Michoacan, Sinaloa, and BCS) and the other grouping sequences from the center of the country (Mexico State and Mexico City) along with one sequence from Tamaulipas ( Figure 2).
The third clade (Clade C) grouped sequences from the south of the country (Yucatan, Tabasco, Quintana Roo, Chiapas, Oaxaca, and Veracruz) and the North Pacific (Colima, Sinaloa and Baja California Sur). One sequence which was the most probable root state location for this clade (location probability of 0.67) was obtained in southern Chiapas, in November 2014 at the beginning of the outbreak [24]. The clade was divided into two groups: the first grouping sequences from Chiapas, Tabasco, and Yucatán with the most probable root location in Quintana Roo (location probability 0.67), and the second with the most probable root location in Oaxaca (location probability 1.0) grouping sequences from Chiapas, Veracruz, Colima, Sinaloa, and Baja California Sur. The Jalisco sequence belonging to an imported case occurred in May 2014 [23] grouped in Clade C. However, it did not spread to other locations (location probability of 1.0) ( Figure 2).
Transformed data from the maximum clade credibility phylogeographic into a phylogeographic model, suggested that CHIKV spread from Chiapas to Oaxaca, Guerrero, and Quintana Roo during the first three months of 2015 (Figure 3a). By June 2015, the virus spread through the south and center of the country, reaching Veracruz, Mexico State, and Mexico City from Quintana Roo and went along the Pacific coast to Colima (Figure 3b   The second clade (Clade B) grouped sequences from the Pacific coast (Guerrero, Colima, Michoacan, Sinaloa, Baja California, and Baja California Sur), center (Mexico City and Mexico State) and north (Nuevo Leon and Tamaulipas) of the country. The most probable root state location for this node was one strain obtained in southeast Chiapas in October 2014 at the beginning of the outbreak (location probability of 1). The next internal nodes indicated Guerrero and Colima as the most probable locations (location probability of 0.51). After this, the descending sequences were grouped into two clades: one grouping sequences from the north Pacific coast (Michoacan, Sinaloa, and BCS) and the other grouping sequences from the center of the country (Mexico State and Mexico City) along with one sequence from Tamaulipas ( Figure 2). The third clade (Clade C) grouped sequences from the south of the country (Yucatan, Tabasco, Quintana Roo, Chiapas, Oaxaca, and Veracruz) and the North Pacific (Colima, Sinaloa and Baja California Sur). One sequence which was the most probable root state location for this clade (location probability of 0.67) was obtained in southern Chiapas, in November 2014 at the beginning of the outbreak [24]. The clade was divided into two groups: the first grouping sequences from Chiapas, Tabasco, and Yucatán with the most probable root location in Quintana Roo (location probability 0.67), and the second with the most probable root location in Oaxaca (location probability 1.0) grouping sequences from Chiapas, Veracruz, Colima, Sinaloa, and Baja California Sur. The Jalisco sequence belonging to an imported case occurred in May 2014 [23]

Discussion
Previous studies on the molecular diversification of CHIKV in Mexico were restricted to limited geographic regions of the country, such as Chiapas [25] and Tamaulipas [27]. Herein we extended these studies with samples collected across Mexico.

Discussion
Previous studies on the molecular diversification of CHIKV in Mexico were restricted to limited geographic regions of the country, such as Chiapas [25] and Tamaulipas [27]. Herein we extended these studies with samples collected across Mexico.
The standard approach, using only one structural gene of the virus [26,29], to study molecular diversification has been shaped by the relative technical easiness for increasing the number of taxa and the desire to study as many taxa as possible. However, the use of small amounts of phylogenetic signal generates phylogenetic hypotheses that are incongruent or lack support [56]. For instance, a maximum likelihood analysis that used a 1044 bp region from the E1 gene did not resolve the ECSA lineage as monophyletic [29], as has been shown in whole genome analyses [57]. As previously observed using other approaches, combined gene analyses can be superior to single-gene analyses for the resolution of internal branches and the position of taxa forming long branches [58]. Here, the sequence constructed with concatenated highly variable regions the NSP3, E1 and E2 proved to resolve the clades of the ECSA and Asiatic genotypes as monophyletic, including the IOL and Caribbean lineages.
In this study, we used 59 sequences, along with published sequences, from CHIK cases obtained during the outbreak that occurred in Mexico between 2014 and 2016. We found 70 non-synonymous mutations in our new sequences compared to the sequences sampled at the beginning of the outbreak, 23 in NSP3, 24 in E1, and 23 in E2. However, we did not observe the same mutations in isolates from different geographical regions or times, suggesting that diversifying selection does not occur. The mutation E1 A226V found in IOL-ECSA sequences that confers increased infectivity to Ae. albopictus and favors the dissemination of CHIKV but has inconsistent effects in the infectivity to Ae. Aegypti [59,60], was not present in our samples. However, the E1 K211E mutation, occurring in all our samples, was reported to increase virus fitness in Ae. aegypti, increasing the virus infectivity, dissemination and transmission, compared to the E1 A226V virus [61]. In the same way, three non-synonymous mutations (E2 S248F and NSP3 A437T and L451F) were present in all of our sequences, as well as in the sequences from the Caribbean outbreak [62], pointing at this geographic region as the origin of the virus that caused the Mexican outbreak, as previously inferred [26].
A recent study demonstrated that another mutation in this site (E2 S248L) was beneficial for the dissemination of the virus by Ae. albopictus [55]. Whether the amino acid substitution of E2 S248F would have a beneficial effect on the viral fitness in Ae. aegypti, the main mosquito vector in Mexico and the Caribbean region needs further investigation. There is not much information about the A437T and L451F NSP3 mutations. The A437V mutation (same site as the A437T mutation) has been reported in samples of the ECSA genotype [51], suggesting that amino acid changes at this site may be related to adaptations to Ae aegypti and Ae albopictus, as could be the case of the E2 S248F mutation.
Another study found these three mutations (E2 S248F and NSP3 A437T and L451F) in samples from the CHIKV outbreak in the Philippines in 2012 [54]. This study suggests the possibility that the strain called Cosmopolitan Asian CHIKV (CACV) responsible for outbreaks in the Caribbean had its origin in the Philippines. Our results support this hypothesis, as these mutations were present only in the Caribbean lineage (including our sequences) and one isolate from the Philippines. In addition, in our phylogenetic analysis, the sequence from the Philippines clustered closely to the clade of the Caribbean lineage.
Before assessing the spread dynamics of the sequences included in our study, we performed a phylogeny-trait association test to estimate the association between sampling location and phylogeny. Although the values for PS and AI were significant in general (p-value < 0.001), locations with the larger number of samples showed greater association (Table S4). A possible explanation for these results is that discrete trait analysis is sensitive to the relative sampling intensity of subpopulations, such that the sampling strategy can influence the results, particularly when migration rates are high [63]. Unbiased sampling can be very hard to achieve, as it requires knowledge of the full geographic range of an outbreak, access to the entire range, and extensive sampling and sequencing efforts. As our samples were obtained from clinical cases demanding attention, sampling size at different locations was not controlled and the results should be assessed with this limitation in mind.
To assess the spread dynamics of CHIKV lineages in Mexico, we analyzed the NSP3, E2, and E1 partial genes of our 59 sequences along with other sequences sampled globally. Our phylogenetic analysis resolved the genotypes ECSA and Asian with good branch support for all lineages including the IOL and Caribbean lineage. Our isolates were closely grouped with Central American sequences within the Asian genotype, more precisely in the Caribbean lineage, as reported in previous studies [23,25,26,29]. This supports the hypothesis that CHIKV reached Mexico through the Caribbean and Central America [26], where outbreaks were reported before the outbreak began in Mexico [64]. The CHIKV epidemic in the Americas derived from the re-emergent Asian lineage [21]. According to our results, the ECSA genotype previously reported in Brazil [65] was not found and the only circulating genotype during the outbreak in Mexico was of the Asian lineage. The closeness of clustering among the sequences from Central America and Mexico suggests the possibility of epidemiologically related transmission between Mexico and Central America.
A previous study documented that samples obtained in 2014 in the southeast and southern Chiapas grouped with other isolates from Nicaragua and the Caribbean [66]. According to our phylogenetic and phylogeographic analysis, the sequence isolated in southeast Chiapas (15 October 2014) could have moved through Mexico until reaching the northern border in November 2015 (Clade B), while the sequence isolated in southern Chiapas a month later (15 November 2014) spread throughout the south of the country and the Pacific coast (Clade C).
A previous study hypothesized that there was at least another introduction of CHIKV to the country, since five of their sequences from southern Chiapas grouped with sequences from Nicaragua that were obtained during 2014 and 2015 [25]. Our results also suggest a third introduction, since our sequences clustered closely with sequences from Central and South America in three different clades. In addition, sequences collected in Chiapas did not cluster with the sequences obtained early in the outbreak (two from Chiapas and one from Jalisco) and formed an independent clade with sequences from other parts of the country (Clade A).
The one sample from Jalisco isolated from an imported case at the beginning of the outbreak [23] grouped in Clade C. However, the most probable root location for his clade (Location probability of 0.67) was a sequence obtained in southern Chiapas at the beginning of the outbreak, suggesting that this introduction from an imported case did not spread throughout the country or did so in a limited way. This is the first spatiotemporal reconstruction of the evolutionary history of CHIKV across Mexico during the outbreak occurring between 2014 and 2016. CHIKV spread from the southern border in Chiapas through the Mexican Caribbean and the Pacific states, before reaching the center and north of the country. The main dissemination of the virus occurred from the south to the rest of the country. Human movements occur mainly between neighboring states, but Quintana Roo, with high local and international tourist and job attractions, possibly played a major role in the dissemination of this virus on a seasonal basis.

Conclusions
Phylogenetic information from a combined analysis of structural and non-structural genes can offer higher resolution than a region of the same size from a single gene such as E1.
The unique amino acid substitutions common among our samples and samples from other parts of the world suggest that mutations that increase the ability of the virus to replicate and spread in areas where the main vector is Ae. aegypti occurred before its introduction into Mexico. While the possible biological importance of these mutations is still unknown, the genetic signatures identified in the study represent interesting candidates for future in-depth study and epidemiological follow-up.
The phylogenetic and phylogeographic analyzes indicated that the only circulating genotype was of the Asian lineage. Multiple introductions of the virus to southern Mexico from Central America and the Caribbean spread in a short and localized way, but the state of Quintana Roo may have played a major role in the spread of the virus to other regions of the country.

Institutional Review Board Statement:
The study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Institutional Review Board (Ethics Biosafety and Research Committee) of National Institute of Public Health of Mexico (protocol code 1138, approved on 13 November 2020).

Data Availability Statement:
The data presented in this study are openly available in GenBank (SRA) with accession number PRJNA772723.